R Basics

From Simple addition to data frames, graphs and cleaning data sets

Basic Building Blocks

In [4]:
5+7
12
In [5]:
x<-5+7 #x equals to five plus seven 
x
12
In [6]:
y<-x-3
y
9

Any object that contains data is called a data structure and numeric vectors are the simplest type of data structure in R.

The easiest way to create a vector is with the c() function, which stands for 'concatenate' or 'combine'.

In [8]:
z<-c(1,2.2,3)
z
  1. 1
  2. 2.2
  3. 3
In [9]:
#for help use ?funcation_name will give the documentation
?c
In [10]:
#combining the vectors 
c(z,555,z)
  1. 1
  2. 2.2
  3. 3
  4. 555
  5. 1
  6. 2.2
  7. 3
In [11]:
#operations on vector 
z*2+100
  1. 102
  2. 104.4
  3. 106

Other common arithmetic operators are +, -, /, and ^ (where x^2 means 'x squared'). To take the square root, use the sqrt() function and to take the absolute value, use the abs() function.

In [14]:
my_sqrt<-sqrt(z-1)
my_sqrt
  1. 0
  2. 1.09544511501033
  3. 1.4142135623731
In [15]:
my_div<-z/my_sqrt
my_div
  1. Inf
  2. 2.00831604418561
  3. 2.12132034355964

When given two vectors of the same length, R simply performs the specified arithmetic operation (+, -, *, etc.) element-by-element. If the vectors are of different lengths, R 'recycles' the shorter vector until it is the same length as the longer vector.

In [16]:
#example
c(1,2,3,4)+c(0,10)
  1. 1
  2. 12
  3. 3
  4. 14
In [17]:
#incase longer vetor is not a multiple of shorter vector
c(1,2,3,4)+c(0,10,100)
Warning message in c(1, 2, 3, 4) + c(0, 10, 100):
“longer object length is not a multiple of shorter object length”
  1. 1
  2. 12
  3. 103
  4. 4

Workspace and Files

In [19]:
#Determine which directory your R session is using as its current working directory using getwd().
getwd()
'/Users/prashanth/DS-14.310x'
In [20]:
# List all the objects in your local workspace using ls()
ls()
  1. 'my_div'
  2. 'my_sqrt'
  3. 'x'
  4. 'y'
  5. 'z'
In [22]:
#List all the files in your working directory using list.files() or dir().
list.files()
  1. '14 310x Intro to R.ipynb'
  2. '14_310x_Intro_to_R_.zip'
  3. 'RStudio-1.1.453.dmg'
In [23]:
dir()
  1. '14 310x Intro to R.ipynb'
  2. '14_310x_Intro_to_R_.zip'
  3. 'RStudio-1.1.453.dmg'
In [26]:
#Using the args() function on a function name is also a handy way to see what arguments a function can take.
args(list.files)
function (path = ".", pattern = NULL, all.files = FALSE, full.names = FALSE, 
    recursive = FALSE, ignore.case = FALSE, include.dirs = FALSE, 
    no.. = FALSE) 
NULL
In [27]:
#Use dir.create() to create a directory in the current working directory called "testdir".
dir.create("testdir")
In [28]:
#Set your working directory to "testdir" with the setwd() command.
setwd("testdir")
In [29]:
#Create a file in your working directory called "mytest.R" using the file.create() function.
file.create("mytest.R")
TRUE
In [31]:
list.files()
'mytest.R'
In [32]:
#Check to see if "mytest.R" exists in the working directory using the file.exists() function.
file.exists("mytest.R")
TRUE
In [33]:
#Access information about the file "mytest.R" by using file.info().
file.info("mytest.R")
sizeisdirmodemtimectimeatimeuidgidunamegrname
mytest.R0 FALSE 644 2018-06-06 09:55:042018-06-06 09:55:042018-06-06 09:55:04501 20 prashanth staff
In [34]:
#You can use the $ operator --- e.g., file.info("mytest.R")$mode --- to grab specific items.
In [35]:
#Change the name of the file "mytest.R" to "mytest2.R" by using file.rename().
file.rename("mytest.R","mytest2.R")
TRUE
In [36]:
#Make a copy of "mytest2.R" called "mytest3.R" using file.copy().
file.copy("mytest2.R","mytest3.R")
TRUE
In [37]:
#Provide the relative path to the file "mytest3.R" by using file.path().
file.path("mytest3.R")
'mytest3.R'
In [38]:
#You can use file.path to construct file and directory paths that are independent of the operating system
#your R code is running on. Pass 'folder1' and 'folder2' as arguments to file.path to make a
# platform-independent pathname.
file.path("folder1","folder2")
'folder1/folder2'
In [39]:
# Create a directory in the current working directory called "testdir2" and a subdirectory for it called
# "testdir3", all in one command by using dir.create() and file.path().
dir.create(file.path("testdir2","testdir3"), recursive = TRUE)
In [41]:
# To delete a directory you need to use the recursive = TRUE argument with the function unlink(). If you
# don't use recursive = TRUE, R is concerned that you're unaware that you're deleting a directory and all
# of its contents. R reasons that, if you don't specify that recursive equals TRUE, you don't know that
# something is in the directory you're trying to delete. R tries to prevent you from making a mistake.
unlink("testdir2") #dosen't work
In [43]:
unlink("testdir2", recursive = TRUE) #works
In [45]:
getwd()
'/Users/prashanth/DS-14.310x/testdir'
In [46]:
setwd('/Users/prashanth/DS-14.310x')
In [47]:
getwd()
'/Users/prashanth/DS-14.310x'
In [48]:
list.files()
  1. '14 310x Intro to R.ipynb'
  2. '14_310x_Intro_to_R_.zip'
  3. 'RStudio-1.1.453.dmg'
  4. 'testdir'
In [49]:
#Delete the 'testdir' directory that you just left (and everything in it)
unlink("testdir", recursive = TRUE)

Sequences of Numbers

In [50]:
#The simplest way to create a sequence of numbers in R is by using the `:` operator. Type 1:20 to see how it works.
1:20
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
In [51]:
pi:10
  1. 3.14159265358979
  2. 4.14159265358979
  3. 5.14159265358979
  4. 6.14159265358979
  5. 7.14159265358979
  6. 8.14159265358979
  7. 9.14159265358979
In [52]:
15:1
  1. 15
  2. 14
  3. 13
  4. 12
  5. 11
  6. 10
  7. 9
  8. 8
  9. 7
  10. 6
  11. 5
  12. 4
  13. 3
  14. 2
  15. 1
In [53]:
#Documentation for operators, Pull up the documentation for `:` now.
?`:`
In [54]:
# Often, we'll desire more control over a sequence we're creating than what the `:` operator gives us. The
# seq() function serves this purpose.
seq(1,20)
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
In [55]:
seq(1,10, by=0.5) #by half increments
  1. 1
  2. 1.5
  3. 2
  4. 2.5
  5. 3
  6. 3.5
  7. 4
  8. 4.5
  9. 5
  10. 5.5
  11. 6
  12. 6.5
  13. 7
  14. 7.5
  15. 8
  16. 8.5
  17. 9
  18. 9.5
  19. 10
In [57]:
my_seq<-seq(5, 10,length=30)
my_seq#30 breaks between numbers
  1. 5
  2. 5.17241379310345
  3. 5.3448275862069
  4. 5.51724137931035
  5. 5.68965517241379
  6. 5.86206896551724
  7. 6.03448275862069
  8. 6.20689655172414
  9. 6.37931034482759
  10. 6.55172413793103
  11. 6.72413793103448
  12. 6.89655172413793
  13. 7.06896551724138
  14. 7.24137931034483
  15. 7.41379310344828
  16. 7.58620689655172
  17. 7.75862068965517
  18. 7.93103448275862
  19. 8.10344827586207
  20. 8.27586206896552
  21. 8.44827586206897
  22. 8.62068965517241
  23. 8.79310344827586
  24. 8.96551724137931
  25. 9.13793103448276
  26. 9.31034482758621
  27. 9.48275862068965
  28. 9.6551724137931
  29. 9.82758620689655
  30. 10
In [59]:
length(my_seq)
30
In [60]:
1:length(my_seq)
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
In [66]:
seq(along.with=my_seq)
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
In [68]:
seq_along(my_seq)
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
In [69]:
#One more function related to creating sequences of numbers is rep(), which stands for 'replicate'. Let's 
#look at a few uses.
rep(0, times= 40)
  1. 0
  2. 0
  3. 0
  4. 0
  5. 0
  6. 0
  7. 0
  8. 0
  9. 0
  10. 0
  11. 0
  12. 0
  13. 0
  14. 0
  15. 0
  16. 0
  17. 0
  18. 0
  19. 0
  20. 0
  21. 0
  22. 0
  23. 0
  24. 0
  25. 0
  26. 0
  27. 0
  28. 0
  29. 0
  30. 0
  31. 0
  32. 0
  33. 0
  34. 0
  35. 0
  36. 0
  37. 0
  38. 0
  39. 0
  40. 0
In [70]:
rep(c(0,1,2), times = 10)#repeating vector
  1. 0
  2. 1
  3. 2
  4. 0
  5. 1
  6. 2
  7. 0
  8. 1
  9. 2
  10. 0
  11. 1
  12. 2
  13. 0
  14. 1
  15. 2
  16. 0
  17. 1
  18. 2
  19. 0
  20. 1
  21. 2
  22. 0
  23. 1
  24. 2
  25. 0
  26. 1
  27. 2
  28. 0
  29. 1
  30. 2
In [71]:
rep(c(0, 1, 2), each = 10) # repeat each number 10 times
  1. 0
  2. 0
  3. 0
  4. 0
  5. 0
  6. 0
  7. 0
  8. 0
  9. 0
  10. 0
  11. 1
  12. 1
  13. 1
  14. 1
  15. 1
  16. 1
  17. 1
  18. 1
  19. 1
  20. 1
  21. 2
  22. 2
  23. 2
  24. 2
  25. 2
  26. 2
  27. 2
  28. 2
  29. 2
  30. 2

Vectors

The simplest and most common data structure in R is the vector. Vectors come in two different flavors: atomic vectors and lists. An atomic vector contains exactly one data type, whereas a list may contain multiple data types.

Types of atomic vectors include logical, character, integer, and complex. Logical vectors can contain the values TRUE, FALSE, and NA (for 'not available'). These values are generated as the result of logical 'conditions'.

In [72]:
num_vect<-c(0.5,55,-10,6)
tf <- num_vect < 1
tf
  1. TRUE
  2. FALSE
  3. TRUE
  4. FALSE
In [73]:
num_vect>=6
  1. FALSE
  2. TRUE
  3. FALSE
  4. TRUE

The < and >= symbols in these examples are called 'logical operators'. Other logical operators include >, <=, == for exact equality, and != for inequality.

If we have two logical expressions, A and B, we can ask whether at least one is TRUE with A | B (logical 'or' a.k.a. 'union') or whether they are both TRUE with A & B (logical 'and' a.k.a. 'intersection'). Lastly, !A is the negation of A and is TRUE when A is FALSE and vice versa.

In [75]:
(3 > 5) & (4 == 4)
FALSE
In [76]:
(TRUE == TRUE) | (TRUE == FALSE)
TRUE
In [74]:
((111 >= 111) | !(TRUE)) & ((4 + 1) == 5)
TRUE
In [78]:
# Create a character vector that contains the following words: "My", "name", "is". Remember to enclose each
# word in its own set of double quotes, so that R knows they are character strings. Store the vector in a
# variable called my_char.
my_char<-c("My","name","is")
my_char
  1. 'My'
  2. 'name'
  3. 'is'
In [81]:
paste(my_char, collapse = " ")#combines the strings in a vector
'My name is'
In [85]:
my_name=c(my_char,"chika chika slam shady")#string concatination
my_name
  1. 'My'
  2. 'name'
  3. 'is'
  4. 'chika chika slam shady'
In [86]:
paste(my_name, collapse = " ")
'My name is chika chika slam shady'
In [87]:
paste("Hello", "world!", sep = " ")
'Hello world!'
In [88]:
paste(1:3,c("X","Y","Z"),sep="") #integrs and charactors
  1. '1X'
  2. '2Y'
  3. '3Z'
In [89]:
#Try paste(LETTERS, 1:4, sep = "-"), where LETTERS is a predefined variable in R 
# containing a character vector of all 26 letters in the English alphabet.
paste(LETTERS, 1:4, sep = "-")
  1. 'A-1'
  2. 'B-2'
  3. 'C-3'
  4. 'D-4'
  5. 'E-1'
  6. 'F-2'
  7. 'G-3'
  8. 'H-4'
  9. 'I-1'
  10. 'J-2'
  11. 'K-3'
  12. 'L-4'
  13. 'M-1'
  14. 'N-2'
  15. 'O-3'
  16. 'P-4'
  17. 'Q-1'
  18. 'R-2'
  19. 'S-3'
  20. 'T-4'
  21. 'U-1'
  22. 'V-2'
  23. 'W-3'
  24. 'X-4'
  25. 'Y-1'
  26. 'Z-2'

Note: If the lenghts of the vectors are not equal the the shorter vector repates


Missing Values

Missing values play an important role in statistics and data analysis. Often, missing values must not be ignored, but rather they should be carefully studied to see if there's an underlying pattern or cause for their missingness.

In R, NA is used to represent any value that is 'not available' or 'missing' (in the statistical sense). In this lesson, we'll explore missing values further. Any operation involving NA generally yields NA as the result.

In [90]:
x<-c(44,NA,5,NA)
x*3
  1. 132
  2. <NA>
  3. 15
  4. <NA>
In [93]:
y <- rnorm(1000) # vector containing 1000 draws from a standard normal distribution
z<- rep(NA, 1000) # vector of NA's
my_data <- sample(c(y,z), 100) #collecting random 100 sample from both the vectors
my_na <- is.na(my_data) #TRUE if value is  NA else FALSE
my_na
  1. FALSE
  2. FALSE
  3. TRUE
  4. TRUE
  5. FALSE
  6. FALSE
  7. TRUE
  8. FALSE
  9. FALSE
  10. TRUE
  11. TRUE
  12. TRUE
  13. TRUE
  14. FALSE
  15. FALSE
  16. FALSE
  17. TRUE
  18. FALSE
  19. TRUE
  20. FALSE
  21. FALSE
  22. TRUE
  23. TRUE
  24. FALSE
  25. FALSE
  26. FALSE
  27. TRUE
  28. FALSE
  29. FALSE
  30. TRUE
  31. TRUE
  32. TRUE
  33. FALSE
  34. TRUE
  35. TRUE
  36. FALSE
  37. TRUE
  38. TRUE
  39. TRUE
  40. FALSE
  41. TRUE
  42. TRUE
  43. TRUE
  44. FALSE
  45. FALSE
  46. TRUE
  47. TRUE
  48. FALSE
  49. TRUE
  50. TRUE
  51. TRUE
  52. TRUE
  53. TRUE
  54. TRUE
  55. FALSE
  56. TRUE
  57. TRUE
  58. FALSE
  59. TRUE
  60. TRUE
  61. TRUE
  62. FALSE
  63. TRUE
  64. TRUE
  65. FALSE
  66. FALSE
  67. TRUE
  68. FALSE
  69. FALSE
  70. TRUE
  71. TRUE
  72. TRUE
  73. FALSE
  74. FALSE
  75. FALSE
  76. TRUE
  77. TRUE
  78. FALSE
  79. TRUE
  80. TRUE
  81. FALSE
  82. FALSE
  83. FALSE
  84. TRUE
  85. FALSE
  86. FALSE
  87. TRUE
  88. FALSE
  89. TRUE
  90. FALSE
  91. FALSE
  92. TRUE
  93. FALSE
  94. TRUE
  95. TRUE
  96. FALSE
  97. TRUE
  98. TRUE
  99. TRUE
  100. TRUE
In [95]:
my_data == NA # wont work , just gives NA's of vector lenght. Careful !
  1. <NA>
  2. <NA>
  3. <NA>
  4. <NA>
  5. <NA>
  6. <NA>
  7. <NA>
  8. <NA>
  9. <NA>
  10. <NA>
  11. <NA>
  12. <NA>
  13. <NA>
  14. <NA>
  15. <NA>
  16. <NA>
  17. <NA>
  18. <NA>
  19. <NA>
  20. <NA>
  21. <NA>
  22. <NA>
  23. <NA>
  24. <NA>
  25. <NA>
  26. <NA>
  27. <NA>
  28. <NA>
  29. <NA>
  30. <NA>
  31. <NA>
  32. <NA>
  33. <NA>
  34. <NA>
  35. <NA>
  36. <NA>
  37. <NA>
  38. <NA>
  39. <NA>
  40. <NA>
  41. <NA>
  42. <NA>
  43. <NA>
  44. <NA>
  45. <NA>
  46. <NA>
  47. <NA>
  48. <NA>
  49. <NA>
  50. <NA>
  51. <NA>
  52. <NA>
  53. <NA>
  54. <NA>
  55. <NA>
  56. <NA>
  57. <NA>
  58. <NA>
  59. <NA>
  60. <NA>
  61. <NA>
  62. <NA>
  63. <NA>
  64. <NA>
  65. <NA>
  66. <NA>
  67. <NA>
  68. <NA>
  69. <NA>
  70. <NA>
  71. <NA>
  72. <NA>
  73. <NA>
  74. <NA>
  75. <NA>
  76. <NA>
  77. <NA>
  78. <NA>
  79. <NA>
  80. <NA>
  81. <NA>
  82. <NA>
  83. <NA>
  84. <NA>
  85. <NA>
  86. <NA>
  87. <NA>
  88. <NA>
  89. <NA>
  90. <NA>
  91. <NA>
  92. <NA>
  93. <NA>
  94. <NA>
  95. <NA>
  96. <NA>
  97. <NA>
  98. <NA>
  99. <NA>
  100. <NA>

underneath the surface, R represents TRUE as the number 1 and FALSE as the number 0.

In [96]:
sum(my_na) #sum will give us how many TRUE
56
In [97]:
#let's look at a second type of missing value -- NaN, which stands for 'not a number'.
0/0
NaN
In [98]:
Inf-Inf #Inf stands for infinity
NaN

Subsetting Vectors

In this lesson, we'll see how to extract elements from a vector based on some conditions that we specify.

In [101]:
x <- rep(c(NA,2.5,NA,-1),10) #sample vector
x
  1. <NA>
  2. 2.5
  3. <NA>
  4. -1
  5. <NA>
  6. 2.5
  7. <NA>
  8. -1
  9. <NA>
  10. 2.5
  11. <NA>
  12. -1
  13. <NA>
  14. 2.5
  15. <NA>
  16. -1
  17. <NA>
  18. 2.5
  19. <NA>
  20. -1
  21. <NA>
  22. 2.5
  23. <NA>
  24. -1
  25. <NA>
  26. 2.5
  27. <NA>
  28. -1
  29. <NA>
  30. 2.5
  31. <NA>
  32. -1
  33. <NA>
  34. 2.5
  35. <NA>
  36. -1
  37. <NA>
  38. 2.5
  39. <NA>
  40. -1
In [103]:
#The way you tell R that you want to select some particular elements
#(i.e. a 'subset') from a vector is by placing an 'index vector' in
#square brackets immediately following the name of the vector.
In [102]:
x[1:10] #first 10 elements
  1. <NA>
  2. 2.5
  3. <NA>
  4. -1
  5. <NA>
  6. 2.5
  7. <NA>
  8. -1
  9. <NA>
  10. 2.5
In [104]:
# Index vectors come in four different flavors -- logical vectors, vectors
# of positive integers, vectors of negative integers, and vectors of
# character strings
In [105]:
x[is.na(x)] # gives all NA's in vector
  1. <NA>
  2. <NA>
  3. <NA>
  4. <NA>
  5. <NA>
  6. <NA>
  7. <NA>
  8. <NA>
  9. <NA>
  10. <NA>
  11. <NA>
  12. <NA>
  13. <NA>
  14. <NA>
  15. <NA>
  16. <NA>
  17. <NA>
  18. <NA>
  19. <NA>
  20. <NA>
In [107]:
y<-x[!is.na(x)]
y #!is.na() is used - negation '!'. Gives all non NA elements
  1. 2.5
  2. -1
  3. 2.5
  4. -1
  5. 2.5
  6. -1
  7. 2.5
  8. -1
  9. 2.5
  10. -1
  11. 2.5
  12. -1
  13. 2.5
  14. -1
  15. 2.5
  16. -1
  17. 2.5
  18. -1
  19. 2.5
  20. -1
In [108]:
y[y>0] #all y values where y>0
  1. 2.5
  2. 2.5
  3. 2.5
  4. 2.5
  5. 2.5
  6. 2.5
  7. 2.5
  8. 2.5
  9. 2.5
  10. 2.5
In [109]:
x[!is.na(x) & x>0] # combination of above commands
  1. 2.5
  2. 2.5
  3. 2.5
  4. 2.5
  5. 2.5
  6. 2.5
  7. 2.5
  8. 2.5
  9. 2.5
  10. 2.5

Many programming languages use what's called 'zero-based indexing', which means that the first element of a vector is considered element 0. R uses 'one-based indexing', which (you guessed it!) means the first element of a vector is considered element 1.

In [111]:
x[1] #is the 1st element
<NA>
In [114]:
x[c(3,4,7)] #3rd 4th and 7th element
  1. <NA>
  2. -1
  3. <NA>
In [115]:
x[0] #gives nothing
In [117]:
x[3000] #gives NA hence be carful about the lenght of the vector
<NA>
In [118]:
x[c(-2,-10)] #gives all the elemnts other than 2nd and 10th
  1. <NA>
  2. <NA>
  3. -1
  4. <NA>
  5. 2.5
  6. <NA>
  7. -1
  8. <NA>
  9. <NA>
  10. -1
  11. <NA>
  12. 2.5
  13. <NA>
  14. -1
  15. <NA>
  16. 2.5
  17. <NA>
  18. -1
  19. <NA>
  20. 2.5
  21. <NA>
  22. -1
  23. <NA>
  24. 2.5
  25. <NA>
  26. -1
  27. <NA>
  28. 2.5
  29. <NA>
  30. -1
  31. <NA>
  32. 2.5
  33. <NA>
  34. -1
  35. <NA>
  36. 2.5
  37. <NA>
  38. -1
In [119]:
x[-c(2,10)] #similar to above command
  1. <NA>
  2. <NA>
  3. -1
  4. <NA>
  5. 2.5
  6. <NA>
  7. -1
  8. <NA>
  9. <NA>
  10. -1
  11. <NA>
  12. 2.5
  13. <NA>
  14. -1
  15. <NA>
  16. 2.5
  17. <NA>
  18. -1
  19. <NA>
  20. 2.5
  21. <NA>
  22. -1
  23. <NA>
  24. 2.5
  25. <NA>
  26. -1
  27. <NA>
  28. 2.5
  29. <NA>
  30. -1
  31. <NA>
  32. 2.5
  33. <NA>
  34. -1
  35. <NA>
  36. 2.5
  37. <NA>
  38. -1
In [120]:
vect <- c(foo = 11, bar= 2, norf=NA) #named index vectors
vect
foo
11
bar
2
norf
<NA>
In [121]:
names(vect) #gives all the names
  1. 'foo'
  2. 'bar'
  3. 'norf'
In [122]:
vect2 <- c(11,2,NA) #creating the vector 
names(vect2) <- c("foo","bar","norf") #assigning the names
vect2
foo
11
bar
2
norf
<NA>
In [124]:
identical(vect,vect2) #checks for identical vectors
TRUE
In [125]:
vect["bar"] #selecting based on name
bar: 2
In [126]:
vect[c("foo","bar")] #multiple selection based on name
foo
11
bar
2

Matrices and Data Frames

In this lesson, we'll cover matrices and data frames. Both represent 'rectangular' data types, meaning that they are used to store tabular data, with rows and columns.

In [127]:
# | The main difference, as you'll see, is that matrices can only contain a
# | single class of data, while data frames can consist of many different
# | classes of data.
In [128]:
my_vector <- 1:20
my_vector
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
In [130]:
dim(my_vector) #vector has no dimensions
NULL
In [132]:
length(my_vector) #but it has length
20
In [134]:
dim(my_vector)<- c(4,5) #assigning dimensions 4 rows and 5 column
my_vector
1 5 91317
2 6 101418
3 7 111519
4 8 121620
In [135]:
dim(my_vector)
  1. 4
  2. 5
In [136]:
attributes(my_vector)
$dim =
  1. 4
  2. 5
In [138]:
class(my_vector) #now its type matrix
'matrix'
In [139]:
my_matrix <- my_vector
my_matrix2 = matrix(data=1:20, nrow= 4, ncol= 5) #another way of creating the matrix
my_matrix2
1 5 91317
2 6 101418
3 7 111519
4 8 121620
In [140]:
identical(my_matrix,my_matrix2)
TRUE
In [142]:
patients<- c("Bill","Gina","Kelly","Sean")
cbind(patients,my_matrix) #converts every element in the matrix to string which is not good for working with numbers.
                            #This is called 'implicit coercion', because we didn't ask for it.
patients
Bill 1 5 9 13 17
Gina 2 6 10 14 18
Kelly3 7 11 15 19
Sean 4 8 12 16 20
In [144]:
#Hence better way to do it use data frames
my_data <- data.frame(patients,my_matrix)
my_data
patientsX1X2X3X4X5
Bill 1 5 9 13 17
Gina 2 6 10 14 18
Kelly3 7 11 15 19
Sean 4 8 12 16 20
In [150]:
# Behind the scenes, the data.frame() function takes any number of
#| arguments and returns a single object of class `data.frame` that is
#| composed of the original objects.
class(my_data)
'data.frame'
In [149]:
cnames <- c("patient","age","weight","bp","rating","test")
colnames(my_data) <- cnames #adding column name to data frame
my_data
patientageweightbpratingtest
Bill 1 5 9 13 17
Gina 2 6 10 14 18
Kelly3 7 11 15 19
Sean 4 8 12 16 20

Looking at Data

In [151]:
# | Whenever you're working with a new dataset, the first thing you should
# | do is look at it! What is the format of the data? What are the
# | dimensions? What are the variable names? How are the variables stored?
# | Are there missing data? Are there any flaws in the data?
In [153]:
laliga=read.csv("SP1.csv")
In [155]:
ls()
  1. 'cnames'
  2. 'laliga'
  3. 'my_char'
  4. 'my_data'
  5. 'my_div'
  6. 'my_matrix'
  7. 'my_matrix2'
  8. 'my_na'
  9. 'my_name'
  10. 'my_seq'
  11. 'my_sqrt'
  12. 'my_vector'
  13. 'num_vect'
  14. 'P'
  15. 'patients'
  16. 'Q'
  17. 'tf'
  18. 'vect'
  19. 'vect2'
  20. 'x'
  21. 'y'
  22. 'z'
In [156]:
class(laliga) #object type
'data.frame'
In [157]:
dim(laliga) #dimensions
  1. 380
  2. 64
In [158]:
nrow(laliga) # number of rows
380
In [159]:
ncol(laliga) #number of columns
64
In [160]:
object.size(laliga) #size of the file interms of space occupied on machine
177552 bytes
In [162]:
names(laliga) #column names
  1. 'Div'
  2. 'Date'
  3. 'HomeTeam'
  4. 'AwayTeam'
  5. 'FTHG'
  6. 'FTAG'
  7. 'FTR'
  8. 'HTHG'
  9. 'HTAG'
  10. 'HTR'
  11. 'HS'
  12. 'AS'
  13. 'HST'
  14. 'AST'
  15. 'HF'
  16. 'AF'
  17. 'HC'
  18. 'AC'
  19. 'HY'
  20. 'AY'
  21. 'HR'
  22. 'AR'
  23. 'B365H'
  24. 'B365D'
  25. 'B365A'
  26. 'BWH'
  27. 'BWD'
  28. 'BWA'
  29. 'IWH'
  30. 'IWD'
  31. 'IWA'
  32. 'LBH'
  33. 'LBD'
  34. 'LBA'
  35. 'PSH'
  36. 'PSD'
  37. 'PSA'
  38. 'WHH'
  39. 'WHD'
  40. 'WHA'
  41. 'VCH'
  42. 'VCD'
  43. 'VCA'
  44. 'Bb1X2'
  45. 'BbMxH'
  46. 'BbAvH'
  47. 'BbMxD'
  48. 'BbAvD'
  49. 'BbMxA'
  50. 'BbAvA'
  51. 'BbOU'
  52. 'BbMx.2.5'
  53. 'BbAv.2.5'
  54. 'BbMx.2.5.1'
  55. 'BbAv.2.5.1'
  56. 'BbAH'
  57. 'BbAHh'
  58. 'BbMxAHH'
  59. 'BbAvAHH'
  60. 'BbMxAHA'
  61. 'BbAvAHA'
  62. 'PSCH'
  63. 'PSCD'
  64. 'PSCA'
In [163]:
head(laliga) #first 6 rows deafult
DivDateHomeTeamAwayTeamFTHGFTAGFTRHTHGHTAGHTRBbAv.2.5.1BbAHBbAHhBbMxAHHBbAvAHHBbMxAHABbAvAHAPSCHPSCDPSCA
SP1 18/08/17 Leganes Alaves 1 0 H 1 0 H 1.46 18 -0.50 2.07 2.03 1.90 1.86 1.98 3.35 4.63
SP1 18/08/17 Valencia Las Palmas1 0 H 1 0 H 2.27 16 -0.75 2.05 1.97 1.96 1.91 1.78 4.24 4.43
SP1 19/08/17 Celta Sociedad 2 3 A 1 1 D 1.84 18 -0.25 2.08 2.05 1.87 1.83 2.12 3.53 3.74
SP1 19/08/17 Girona Ath Madrid2 2 D 2 0 H 1.74 16 1.25 1.77 1.75 2.25 2.16 6.93 3.83 1.63
SP1 19/08/17 Sevilla Espanol 1 1 D 1 1 D 2.09 16 -1.00 2.12 2.06 1.86 1.82 1.64 4.18 5.82
SP1 20/08/17 Ath BilbaoGetafe 0 0 D 0 0 D 1.87 17 -1.00 1.90 1.86 2.05 2.01 1.53 4.48 6.91
In [164]:
head(laliga,10) #first 10 rows
DivDateHomeTeamAwayTeamFTHGFTAGFTRHTHGHTAGHTRBbAv.2.5.1BbAHBbAHhBbMxAHHBbAvAHHBbMxAHABbAvAHAPSCHPSCDPSCA
SP1 18/08/17 Leganes Alaves 1 0 H 1 0 H 1.46 18 -0.50 2.07 2.03 1.90 1.86 1.98 3.35 4.63
SP1 18/08/17 Valencia Las Palmas 1 0 H 1 0 H 2.27 16 -0.75 2.05 1.97 1.96 1.91 1.78 4.24 4.43
SP1 19/08/17 Celta Sociedad 2 3 A 1 1 D 1.84 18 -0.25 2.08 2.05 1.87 1.83 2.12 3.53 3.74
SP1 19/08/17 Girona Ath Madrid 2 2 D 2 0 H 1.74 16 1.25 1.77 1.75 2.25 2.16 6.93 3.83 1.63
SP1 19/08/17 Sevilla Espanol 1 1 D 1 1 D 2.09 16 -1.00 2.12 2.06 1.86 1.82 1.64 4.18 5.82
SP1 20/08/17 Ath Bilbao Getafe 0 0 D 0 0 D 1.87 17 -1.00 1.90 1.86 2.05 2.01 1.53 4.48 6.91
SP1 20/08/17 Barcelona Betis 2 0 H 2 0 H 2.88 17 -2.00 2.05 2.00 1.91 1.86 1.20 8.25 15.20
SP1 20/08/17 La Coruna Real Madrid0 3 A 0 2 A 2.64 16 1.50 2.03 1.98 1.95 1.89 12.40 7.00 1.26
SP1 21/08/17 Levante Villarreal 1 0 H 0 0 D 1.58 15 0.25 1.93 1.89 2.03 1.98 3.31 3.32 2.40
SP1 21/08/17 Malaga Eibar 0 1 A 0 0 D 1.70 17 -0.25 1.92 1.88 2.04 1.99 2.20 3.27 3.85
In [165]:
tail(laliga,15) #last 15 rows
DivDateHomeTeamAwayTeamFTHGFTAGFTRHTHGHTAGHTRBbAv.2.5.1BbAHBbAHhBbMxAHHBbAvAHHBbMxAHABbAvAHAPSCHPSCDPSCA
366SP1 12/05/18 La Coruna Villarreal 2 4 A 0 3 A 2.15 17 0.25 2.19 2.11 1.81 1.76 4.71 4.30 1.71
367SP1 12/05/18 Real MadridCelta 6 0 H 3 0 H 3.94 19 -1.50 1.96 1.91 2.00 1.95 1.25 6.89 11.56
368SP1 12/05/18 Sociedad Leganes 3 2 H 2 1 H 2.04 19 -1.25 2.06 1.99 1.91 1.87 1.40 4.87 9.26
369SP1 13/05/18 Espanol Malaga 4 1 H 3 1 H 1.74 19 -0.75 1.86 1.83 2.07 2.03 1.63 3.97 6.24
370SP1 13/05/18 Levante Barcelona 5 4 H 2 1 H 3.23 18 1.50 2.11 2.06 1.86 1.81 7.70 5.40 1.40
371SP1 19/05/18 Celta Levante 4 2 H 2 1 H 2.82 20 -0.75 2.05 2.01 1.90 1.85 1.60 4.68 5.33
372SP1 19/05/18 Las Palmas Girona 1 2 A 1 2 A 2.36 19 0.25 1.94 1.90 2.00 1.95 3.44 4.01 2.06
373SP1 19/05/18 Leganes Betis 3 2 H 1 1 D 2.19 17 0.00 2.04 1.99 1.91 1.86 2.41 3.76 2.90
374SP1 19/05/18 Malaga Getafe 0 1 A 0 0 D 1.91 19 0.25 1.98 1.92 1.98 1.93 3.26 3.56 2.28
375SP1 19/05/18 Sevilla Alaves 1 0 H 1 0 H 2.86 19 -1.25 1.97 1.91 2.00 1.94 1.32 6.09 9.47
376SP1 19/05/18 Villarreal Real Madrid2 2 D 0 2 A 3.79 19 0.25 2.05 1.98 1.93 1.87 4.74 5.05 1.62
377SP1 20/05/18 Ath Bilbao Espanol 0 1 A 0 1 A 2.06 17 -0.50 2.06 2.02 1.88 1.85 1.95 3.77 4.05
378SP1 20/05/18 Ath Madrid Eibar 2 2 D 1 1 D 1.98 19 -1.00 2.09 2.04 1.87 1.82 1.47 4.25 8.80
379SP1 20/05/18 Barcelona Sociedad 1 0 H 0 0 D 5.04 19 -2.00 1.94 1.89 2.03 1.97 1.31 6.40 8.60
380SP1 20/05/18 Valencia La Coruna 2 1 H 1 0 H 2.98 19 -1.50 2.01 1.97 1.94 1.89 1.27 6.44 10.71
In [166]:
summary(laliga) #summary!!!
  Div            Date           HomeTeam         AwayTeam        FTHG      
 SP1:380   12/05/18:  8   Alaves    : 19   Alaves    : 19   Min.   :0.000  
           19/05/18:  6   Ath Bilbao: 19   Ath Bilbao: 19   1st Qu.:0.750  
           01/04/18:  5   Ath Madrid: 19   Ath Madrid: 19   Median :1.000  
           01/10/17:  5   Barcelona : 19   Barcelona : 19   Mean   :1.547  
           03/03/18:  5   Betis     : 19   Betis     : 19   3rd Qu.:2.000  
           05/11/17:  5   Celta     : 19   Celta     : 19   Max.   :7.000  
           (Other) :346   (Other)   :266   (Other)   :266                  
      FTAG       FTR          HTHG             HTAG        HTR    
 Min.   :0.000   A:115   Min.   :0.0000   Min.   :0.0000   A: 93  
 1st Qu.:0.000   D: 86   1st Qu.:0.0000   1st Qu.:0.0000   D:159  
 Median :1.000   H:179   Median :0.0000   Median :0.0000   H:128  
 Mean   :1.147           Mean   :0.6605   Mean   :0.4868          
 3rd Qu.:2.000           3rd Qu.:1.0000   3rd Qu.:1.0000          
 Max.   :6.000           Max.   :5.0000   Max.   :3.0000          
                                                                  
       HS              AS             HST              AST        
 Min.   : 2.00   Min.   : 1.00   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:10.00   1st Qu.: 8.00   1st Qu.: 3.000   1st Qu.: 2.000  
 Median :13.00   Median :10.00   Median : 4.500   Median : 3.000  
 Mean   :13.53   Mean   :10.47   Mean   : 4.758   Mean   : 3.805  
 3rd Qu.:16.00   3rd Qu.:13.00   3rd Qu.: 6.000   3rd Qu.: 5.000  
 Max.   :30.00   Max.   :24.00   Max.   :14.000   Max.   :13.000  
                                                                  
       HF              AF              HC               AC        
 Min.   : 4.00   Min.   : 0.00   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:11.00   1st Qu.:11.00   1st Qu.: 4.000   1st Qu.: 2.000  
 Median :13.00   Median :14.00   Median : 5.000   Median : 4.000  
 Mean   :13.73   Mean   :13.95   Mean   : 5.613   Mean   : 4.192  
 3rd Qu.:17.00   3rd Qu.:17.00   3rd Qu.: 7.000   3rd Qu.: 6.000  
 Max.   :29.00   Max.   :29.00   Max.   :16.000   Max.   :14.000  
                                                                  
       HY              AY              HR               AR         
 Min.   :0.000   Min.   :0.000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:1.000   1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:0.00000  
 Median :2.000   Median :3.000   Median :0.0000   Median :0.00000  
 Mean   :2.339   Mean   :2.676   Mean   :0.1105   Mean   :0.07895  
 3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.:0.0000   3rd Qu.:0.00000  
 Max.   :8.000   Max.   :9.000   Max.   :2.0000   Max.   :2.00000  
                                                                   
     B365H            B365D            B365A             BWH        
 Min.   : 1.050   Min.   : 2.790   Min.   : 1.170   Min.   : 1.050  
 1st Qu.: 1.617   1st Qu.: 3.290   1st Qu.: 2.600   1st Qu.: 1.650  
 Median : 2.075   Median : 3.500   Median : 3.700   Median : 2.100  
 Mean   : 2.777   Mean   : 4.259   Mean   : 5.192   Mean   : 2.744  
 3rd Qu.: 2.790   3rd Qu.: 4.330   3rd Qu.: 5.500   3rd Qu.: 2.750  
 Max.   :17.000   Max.   :15.000   Max.   :34.000   Max.   :14.500  
                                                                    
      BWD              BWA              IWH              IWD        
 Min.   : 2.950   Min.   : 1.180   Min.   : 1.070   Min.   : 3.050  
 1st Qu.: 3.300   1st Qu.: 2.600   1st Qu.: 1.650   1st Qu.: 3.300  
 Median : 3.600   Median : 3.700   Median : 2.100   Median : 3.500  
 Mean   : 4.278   Mean   : 5.204   Mean   : 2.721   Mean   : 4.161  
 3rd Qu.: 4.330   3rd Qu.: 5.500   3rd Qu.: 2.700   3rd Qu.: 4.200  
 Max.   :15.500   Max.   :34.000   Max.   :15.000   Max.   :12.000  
                                                                    
      IWA              LBH              LBD              LBA        
 Min.   : 1.200   Min.   : 1.050   Min.   : 2.900   Min.   : 1.170  
 1st Qu.: 2.600   1st Qu.: 1.610   1st Qu.: 3.250   1st Qu.: 2.575  
 Median : 3.500   Median : 2.050   Median : 3.500   Median : 3.600  
 Mean   : 5.041   Mean   : 2.742   Mean   : 4.152   Mean   : 5.375  
 3rd Qu.: 5.300   3rd Qu.: 2.750   3rd Qu.: 4.200   3rd Qu.: 5.500  
 Max.   :27.000   Max.   :19.000   Max.   :17.000   Max.   :41.000  
                  NA's   :1        NA's   :1        NA's   :1       
      PSH              PSD              PSA              WHH        
 Min.   : 1.050   Min.   : 3.020   Min.   : 1.180   Min.   : 1.060  
 1st Qu.: 1.660   1st Qu.: 3.410   1st Qu.: 2.670   1st Qu.: 1.665  
 Median : 2.120   Median : 3.705   Median : 3.845   Median : 2.100  
 Mean   : 2.857   Mean   : 4.539   Mean   : 5.522   Mean   : 2.738  
 3rd Qu.: 2.850   3rd Qu.: 4.455   3rd Qu.: 5.942   3rd Qu.: 2.750  
 Max.   :19.650   Max.   :20.380   Max.   :36.500   Max.   :17.000  
                                                                    
      WHD              WHA              VCH              VCD        
 Min.   : 2.900   Min.   : 1.170   Min.   : 1.040   Min.   : 3.000  
 1st Qu.: 3.250   1st Qu.: 2.600   1st Qu.: 1.650   1st Qu.: 3.400  
 Median : 3.500   Median : 3.550   Median : 2.100   Median : 3.700  
 Mean   : 4.092   Mean   : 5.041   Mean   : 2.762   Mean   : 4.416  
 3rd Qu.: 4.200   3rd Qu.: 5.500   3rd Qu.: 2.800   3rd Qu.: 4.400  
 Max.   :15.000   Max.   :26.000   Max.   :15.000   Max.   :17.000  
                                                                    
      VCA             Bb1X2           BbMxH            BbAvH       
 Min.   : 1.180   Min.   : 3.00   Min.   : 1.080   Min.   : 1.050  
 1st Qu.: 2.630   1st Qu.:35.00   1st Qu.: 1.700   1st Qu.: 1.640  
 Median : 3.700   Median :37.00   Median : 2.200   Median : 2.090  
 Mean   : 5.472   Mean   :37.71   Mean   : 2.966   Mean   : 2.743  
 3rd Qu.: 5.750   3rd Qu.:40.00   3rd Qu.: 2.882   3rd Qu.: 2.765  
 Max.   :36.000   Max.   :43.00   Max.   :19.650   Max.   :16.300  
                                                                   
     BbMxD            BbAvD            BbMxA            BbAvA       
 Min.   : 3.110   Min.   : 2.940   Min.   : 1.210   Min.   : 1.170  
 1st Qu.: 3.478   1st Qu.: 3.328   1st Qu.: 2.728   1st Qu.: 2.607  
 Median : 3.750   Median : 3.570   Median : 3.920   Median : 3.665  
 Mean   : 4.636   Mean   : 4.261   Mean   : 6.107   Mean   : 5.190  
 3rd Qu.: 4.553   3rd Qu.: 4.272   3rd Qu.: 6.105   3rd Qu.: 5.543  
 Max.   :20.380   Max.   :15.320   Max.   :67.000   Max.   :33.420  
                                                                    
      BbOU          BbMx.2.5        BbAv.2.5       BbMx.2.5.1   
 Min.   : 3.00   Min.   :1.130   Min.   :1.120   Min.   :1.470  
 1st Qu.:31.75   1st Qu.:1.667   1st Qu.:1.617   1st Qu.:1.780  
 Median :34.00   Median :1.960   Median :1.880   Median :2.000  
 Mean   :34.06   Mean   :1.950   Mean   :1.872   Mean   :2.284  
 3rd Qu.:37.00   3rd Qu.:2.203   3rd Qu.:2.120   3rd Qu.:2.402  
 Max.   :42.00   Max.   :3.080   Max.   :2.850   Max.   :7.000  
                                                                
   BbAv.2.5.1         BbAH           BbAHh            BbMxAHH     
 Min.   :1.410   Min.   : 1.00   Min.   :-3.2500   Min.   :1.610  
 1st Qu.:1.718   1st Qu.:17.00   1st Qu.:-0.7500   1st Qu.:1.890  
 Median :1.920   Median :18.00   Median :-0.2500   Median :1.985  
 Mean   :2.162   Mean   :18.16   Mean   :-0.4059   Mean   :1.988  
 3rd Qu.:2.283   3rd Qu.:19.00   3rd Qu.: 0.0625   3rd Qu.:2.070  
 Max.   :5.970   Max.   :24.00   Max.   : 2.0000   Max.   :2.420  
                                                                  
    BbAvAHH         BbMxAHA         BbAvAHA           PSCH       
 Min.   :1.580   Min.   :1.680   Min.   :1.630   Min.   : 1.060  
 1st Qu.:1.840   1st Qu.:1.897   1st Qu.:1.850   1st Qu.: 1.640  
 Median :1.930   Median :1.970   Median :1.930   Median : 2.120  
 Mean   :1.938   Mean   :1.988   Mean   :1.937   Mean   : 2.839  
 3rd Qu.:2.020   3rd Qu.:2.080   3rd Qu.:2.030   3rd Qu.: 2.980  
 Max.   :2.340   Max.   :2.520   Max.   :2.440   Max.   :18.700  
                                                 NA's   :1       
      PSCD             PSCA       
 Min.   : 2.930   Min.   : 1.160  
 1st Qu.: 3.410   1st Qu.: 2.590  
 Median : 3.700   Median : 3.850  
 Mean   : 4.508   Mean   : 5.695  
 3rd Qu.: 4.560   3rd Qu.: 6.095  
 Max.   :18.500   Max.   :46.000  
 NA's   :1        NA's   :1       
In [167]:
table(laliga$HomeTeam) #table for column Home Team
     Alaves  Ath Bilbao  Ath Madrid   Barcelona       Betis       Celta 
         19          19          19          19          19          19 
      Eibar     Espanol      Getafe      Girona   La Coruna  Las Palmas 
         19          19          19          19          19          19 
    Leganes     Levante      Malaga Real Madrid     Sevilla    Sociedad 
         19          19          19          19          19          19 
   Valencia  Villarreal 
         19          19 
In [169]:
str(laliga) #structure if data
'data.frame':	380 obs. of  64 variables:
 $ Div       : Factor w/ 1 level "SP1": 1 1 1 1 1 1 1 1 1 1 ...
 $ Date      : Factor w/ 137 levels "01/03/18","01/04/18",..: 75 75 83 83 83 90 90 90 97 97 ...
 $ HomeTeam  : Factor w/ 20 levels "Alaves","Ath Bilbao",..: 13 19 6 10 17 2 4 11 14 15 ...
 $ AwayTeam  : Factor w/ 20 levels "Alaves","Ath Bilbao",..: 1 12 18 3 8 9 5 16 20 7 ...
 $ FTHG      : int  1 1 2 2 1 0 2 0 1 0 ...
 $ FTAG      : int  0 0 3 2 1 0 0 3 0 1 ...
 $ FTR       : Factor w/ 3 levels "A","D","H": 3 3 1 2 2 2 3 1 3 1 ...
 $ HTHG      : int  1 1 1 2 1 0 2 0 0 0 ...
 $ HTAG      : int  0 0 1 0 1 0 0 2 0 0 ...
 $ HTR       : Factor w/ 3 levels "A","D","H": 3 3 2 3 2 2 3 1 2 2 ...
 $ HS        : int  16 22 16 13 9 12 15 12 14 10 ...
 $ AS        : int  6 5 13 9 9 8 3 16 9 13 ...
 $ HST       : int  9 6 5 6 4 2 2 6 3 4 ...
 $ AST       : int  3 4 6 3 6 2 0 8 1 6 ...
 $ HF        : int  14 25 12 15 14 16 16 16 18 16 ...
 $ AF        : int  18 13 11 15 12 15 15 12 14 15 ...
 $ HC        : int  4 5 5 6 7 7 8 4 11 3 ...
 $ AC        : int  2 2 4 0 3 6 0 4 6 7 ...
 $ HY        : int  0 3 3 2 2 1 2 5 1 2 ...
 $ AY        : int  1 3 1 4 4 3 1 1 3 3 ...
 $ HR        : int  0 0 0 0 1 0 0 0 0 0 ...
 $ AR        : int  0 1 0 1 0 1 0 1 0 0 ...
 $ B365H     : num  2.05 1.75 2.38 8 1.62 1.5 1.17 9.5 3.25 2.1 ...
 $ B365D     : num  3.2 3.8 3.25 4.33 4 4 8 5.75 3.25 3.3 ...
 $ B365A     : num  4.1 4.5 3.2 1.45 5.5 7.5 15 1.3 2.3 3.7 ...
 $ BWH       : num  2.05 1.75 2.4 7.5 1.62 1.48 1.18 9.25 3.25 2.15 ...
 $ BWD       : num  3.1 3.9 3.3 4.33 3.9 4.25 7.5 5.75 3.2 3.3 ...
 $ BWA       : num  4.1 4.6 3 1.45 5.75 7 14.5 1.3 2.3 3.5 ...
 $ IWH       : num  2.1 1.75 2.5 7.2 1.55 1.5 1.17 7.5 3.3 2.1 ...
 $ IWD       : num  3.4 3.6 3.3 4.4 4 4.2 7.5 5.5 3.35 3.4 ...
 $ IWA       : num  3.5 4.8 2.85 1.45 6.2 6.5 15 1.35 2.2 3.5 ...
 $ LBH       : num  2.05 1.75 2.35 7.5 1.6 1.5 1.2 9.5 3.25 2.1 ...
 $ LBD       : num  3 3.8 3.25 4 3.9 4 6.5 5.25 3.1 3.1 ...
 $ LBA       : num  4.2 4.33 3 1.5 5.5 7 15 1.3 2.3 3.4 ...
 $ PSH       : num  2.03 1.78 2.44 8.36 1.62 ...
 $ PSD       : num  3.25 4.01 3.4 4.38 4.17 4.37 7.35 5.79 3.24 3.36 ...
 $ PSA       : num  4.52 4.83 3.16 1.49 6.18 7.31 15.5 1.33 2.36 3.49 ...
 $ WHH       : num  2.05 1.8 2.4 8 1.67 1.5 1.22 11 3.1 2.2 ...
 $ WHD       : num  3.1 3.75 3.4 4.2 3.6 4 6 4.5 3.1 3.3 ...
 $ WHA       : num  4 4.2 2.9 1.44 5.5 7 13 1.33 2.4 3.3 ...
 $ VCH       : num  2.05 1.8 2.4 7.5 1.65 1.5 1.2 9.5 3.25 2.15 ...
 $ VCD       : num  3.2 4 3.4 4.3 4 4.2 7 5.75 3.25 3.3 ...
 $ VCA       : num  4.4 4.6 3.13 1.5 5.75 7 13 1.3 2.3 3.5 ...
 $ Bb1X2     : int  35 35 35 35 35 34 35 35 34 34 ...
 $ BbMxH     : num  2.12 1.83 2.5 8.36 1.69 ...
 $ BbAvH     : num  2.03 1.77 2.39 7.53 1.63 1.5 1.19 9.68 3.26 2.18 ...
 $ BbMxD     : num  3.4 4.04 3.5 4.4 4.17 4.4 8 5.86 3.35 3.4 ...
 $ BbAvD     : num  3.15 3.86 3.32 4.17 3.93 4.17 7.11 5.44 3.17 3.26 ...
 $ BbMxA     : num  4.52 4.83 3.2 1.51 6.2 7.5 17 1.35 2.4 3.7 ...
 $ BbAvA     : num  4.17 4.46 3.01 1.48 5.58 ...
 $ BbOU      : int  31 33 34 34 33 32 27 27 32 32 ...
 $ BbMx.2.5  : num  2.84 1.69 2.03 2.2 1.81 2.01 1.44 1.5 2.42 2.25 ...
 $ BbAv.2.5  : num  2.68 1.64 1.98 2.11 1.75 1.94 1.4 1.46 2.36 2.14 ...
 $ BbMx.2.5.1: num  1.53 2.4 1.9 1.8 2.14 1.96 3.1 2.95 1.63 1.76 ...
 $ BbAv.2.5.1: num  1.46 2.27 1.84 1.74 2.09 1.87 2.88 2.64 1.58 1.7 ...
 $ BbAH      : int  18 16 18 16 16 17 17 16 15 17 ...
 $ BbAHh     : num  -0.5 -0.75 -0.25 1.25 -1 -1 -2 1.5 0.25 -0.25 ...
 $ BbMxAHH   : num  2.07 2.05 2.08 1.77 2.12 1.9 2.05 2.03 1.93 1.92 ...
 $ BbAvAHH   : num  2.03 1.97 2.05 1.75 2.06 1.86 2 1.98 1.89 1.88 ...
 $ BbMxAHA   : num  1.9 1.96 1.87 2.25 1.86 2.05 1.91 1.95 2.03 2.04 ...
 $ BbAvAHA   : num  1.86 1.91 1.83 2.16 1.82 2.01 1.86 1.89 1.98 1.99 ...
 $ PSCH      : num  1.98 1.78 2.12 6.93 1.64 1.53 1.2 12.4 3.31 2.2 ...
 $ PSCD      : num  3.35 4.24 3.53 3.83 4.18 4.48 8.25 7 3.32 3.27 ...
 $ PSCA      : num  4.63 4.43 3.74 1.63 5.82 6.91 15.2 1.26 2.4 3.85 ...

Base Graphics

In [176]:
data(cars)
In [178]:
head(cars)
speeddist
4 2
4 10
7 4
7 22
8 16
9 10
In [199]:
options(repr.plot.width=4, repr.plot.height=4) #reduce the size of the graph , other wise fills up the screen
plot(cars) #choose first column as x axis and second for y
In [200]:
plot(x=cars$speed, y=cars$dist) #specifying the axis
In [201]:
plot(y=cars$speed, x=cars$dist) #swtiching the axis from above
In [202]:
plot(x=cars$speed, y=cars$dist, xlab= "Speed") #labelling the x-axis
In [203]:
plot(x=cars$speed, y=cars$dist, xlab= "Speed",ylab = "Stopiing Distance") #labeling y-axis
In [197]:
plot(cars,main="My Plot") #title
In [204]:
plot(cars,sub="My Plot Subtitle") #sub title
In [206]:
plot(cars,col=2) #change color for the points
In [207]:
plot(cars,xlim=c(10,15)) #limiting the x-axis
In [208]:
plot(cars,pch=2) #chaning point icon
In [209]:
data(mtcars) #loding data-mtcars
In [211]:
str(mtcars)
'data.frame':	32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
In [212]:
head(mtcars)
mpgcyldisphpdratwtqsecvsamgearcarb
Mazda RX421.0 6 160 110 3.90 2.62016.460 1 4 4
Mazda RX4 Wag21.0 6 160 110 3.90 2.87517.020 1 4 4
Datsun 71022.8 4 108 93 3.85 2.32018.611 1 4 1
Hornet 4 Drive21.4 6 258 110 3.08 3.21519.441 0 3 1
Hornet Sportabout18.7 8 360 175 3.15 3.44017.020 0 3 2
Valiant18.1 6 225 105 2.76 3.46020.221 0 3 1
In [210]:
boxplot(mpg ~ cyl , data=mtcars) #box plot
In [213]:
hist(mtcars$mpg) #histogram

Manipulating Data with dplyr

dplyr is a fast and powerful R package written by Hadley Wickham and Romain Francois that provides a consistent and concise grammar for manipulating tabular data.

In [215]:
library("dplyr") #loading the package
In [216]:
packageVersion("dplyr") #check the version
[1] ‘0.7.4’
In [217]:
mydf=read.csv("SP1.csv") #reading the data set to mydf
In [218]:
cran<-tbl_df(mydf) #"The main advantage to using a tbl_df over a regular data frame is the printing."
In [219]:
cran #jupyter notebook dosen't show tbl_df well
DivDateHomeTeamAwayTeamFTHGFTAGFTRHTHGHTAGHTRBbAv.2.5.1BbAHBbAHhBbMxAHHBbAvAHHBbMxAHABbAvAHAPSCHPSCDPSCA
SP1 18/08/17 Leganes Alaves 1 0 H 1 0 H 1.46 18 -0.50 2.07 2.03 1.90 1.86 1.98 3.35 4.63
SP1 18/08/17 Valencia Las Palmas 1 0 H 1 0 H 2.27 16 -0.75 2.05 1.97 1.96 1.91 1.78 4.24 4.43
SP1 19/08/17 Celta Sociedad 2 3 A 1 1 D 1.84 18 -0.25 2.08 2.05 1.87 1.83 2.12 3.53 3.74
SP1 19/08/17 Girona Ath Madrid 2 2 D 2 0 H 1.74 16 1.25 1.77 1.75 2.25 2.16 6.93 3.83 1.63
SP1 19/08/17 Sevilla Espanol 1 1 D 1 1 D 2.09 16 -1.00 2.12 2.06 1.86 1.82 1.64 4.18 5.82
SP1 20/08/17 Ath Bilbao Getafe 0 0 D 0 0 D 1.87 17 -1.00 1.90 1.86 2.05 2.01 1.53 4.48 6.91
SP1 20/08/17 Barcelona Betis 2 0 H 2 0 H 2.88 17 -2.00 2.05 2.00 1.91 1.86 1.20 8.25 15.20
SP1 20/08/17 La Coruna Real Madrid0 3 A 0 2 A 2.64 16 1.50 2.03 1.98 1.95 1.89 12.40 7.00 1.26
SP1 21/08/17 Levante Villarreal 1 0 H 0 0 D 1.58 15 0.25 1.93 1.89 2.03 1.98 3.31 3.32 2.40
SP1 21/08/17 Malaga Eibar 0 1 A 0 0 D 1.70 17 -0.25 1.92 1.88 2.04 1.99 2.20 3.27 3.85
SP1 25/08/17 Betis Celta 2 1 H 1 1 D 1.85 16 -0.25 2.05 1.96 1.96 1.93 2.32 3.44 3.35
SP1 25/08/17 Sociedad Villarreal 3 0 H 3 0 H 1.76 16 -0.25 1.83 1.80 2.14 2.10 2.04 3.59 3.98
SP1 26/08/17 Alaves Barcelona 0 2 A 0 0 D 2.38 14 1.50 2.05 2.00 1.92 1.89 10.13 5.83 1.33
SP1 26/08/17 Girona Malaga 1 0 H 1 0 H 1.67 14 -0.25 1.88 1.85 2.08 2.03 2.01 3.42 4.33
SP1 26/08/17 Las Palmas Ath Madrid 1 5 A 0 2 A 1.74 14 0.75 1.89 1.86 2.08 2.02 5.31 3.75 1.77
SP1 26/08/17 Levante La Coruna 2 2 D 1 2 A 1.61 14 -0.25 1.93 1.89 2.04 1.99 2.19 3.40 3.71
SP1 27/08/17 Eibar Ath Bilbao 0 1 A 0 1 A 1.67 14 -0.25 2.20 2.17 1.78 1.74 2.52 3.21 3.20
SP1 27/08/17 Espanol Leganes 0 1 A 0 1 A 1.53 14 -0.50 2.00 1.96 1.97 1.93 2.05 3.24 4.47
SP1 27/08/17 Getafe Sevilla 0 1 A 0 0 D 1.93 14 0.25 2.06 2.03 1.90 1.86 3.58 3.50 2.20
SP1 27/08/17 Real MadridValencia 2 2 D 1 1 D 4.20 14 -2.25 2.06 2.02 1.90 1.87 1.15 10.55 17.70
SP1 08/09/17 Leganes Getafe 1 2 A 0 1 A 1.56 15 -0.50 2.14 2.09 1.84 1.81 2.13 3.22 4.17
SP1 09/09/17 Barcelona Espanol 5 0 H 2 0 H 3.19 14 -2.00 1.83 1.78 2.17 2.11 1.13 10.40 23.13
SP1 09/09/17 Real MadridLevante 1 1 D 1 1 D 4.86 15 -3.00 2.07 2.01 1.90 1.86 1.11 12.60 23.75
SP1 09/09/17 Sevilla Eibar 3 0 H 0 0 D 2.16 14 -1.00 2.08 2.04 1.88 1.85 1.57 4.46 6.28
SP1 09/09/17 Valencia Ath Madrid 0 0 D 0 0 D 1.73 15 0.25 2.03 1.99 1.93 1.89 3.21 3.23 2.50
SP1 10/09/17 Ath Bilbao Girona 2 0 H 1 0 H 1.86 14 -0.75 1.92 1.88 2.03 1.99 1.61 4.04 6.67
SP1 10/09/17 Celta Alaves 1 0 H 1 0 H 1.81 14 -0.75 1.99 1.95 1.95 1.92 1.68 3.91 5.87
SP1 10/09/17 La Coruna Sociedad 2 4 A 1 2 A 1.86 15 0.00 2.03 1.99 1.91 1.88 3.12 3.55 2.39
SP1 10/09/17 Villarreal Betis 3 1 H 1 1 D 1.80 14 -0.50 1.92 1.88 2.04 2.01 1.97 3.62 4.25
SP1 11/09/17 Malaga Las Palmas 1 3 A 0 1 A 1.93 12 -0.50 2.07 2.04 1.87 1.86 2.03 3.59 4.02
SP1 05/05/18 Celta La Coruna 1 1 D 1 0 H 2.40 21 -1.00 2.13 2.07 1.84 1.80 1.82 3.95 4.49
SP1 05/05/18 Girona Eibar 1 4 A 0 2 A 1.65 20 -0.25 1.83 1.79 2.14 2.08 2.16 3.42 3.71
SP1 05/05/18 Villarreal Valencia 1 0 H 0 0 D 2.06 20 -0.25 1.98 1.93 1.96 1.92 2.49 3.50 2.95
SP1 06/05/18 Ath Madrid Espanol 0 2 A 0 0 D 1.69 20 -1.25 2.03 1.98 1.92 1.88 1.56 3.70 8.44
SP1 06/05/18 Barcelona Real Madrid2 2 D 1 1 D 3.22 20 -0.75 1.87 1.82 2.11 2.04 1.75 4.64 4.18
SP1 06/05/18 Las Palmas Getafe 0 1 A 0 0 D 1.75 18 0.25 2.17 2.10 1.81 1.76 3.17 3.31 2.46
SP1 06/05/18 Malaga Alaves 0 3 A 0 1 A 1.58 19 -0.25 2.10 2.05 1.85 1.82 2.27 3.31 3.54
SP1 07/05/18 Leganes Levante 0 3 A 0 0 D 1.62 19 -0.25 1.85 1.81 2.12 2.05 2.26 3.24 3.63
SP1 09/05/18 Barcelona Villarreal 5 1 H 3 0 H 3.40 20 -1.75 2.05 1.98 1.93 1.87 1.36 5.06 9.17
SP1 09/05/18 Sevilla Real Madrid3 2 H 2 0 H 2.85 19 -0.25 2.20 2.13 1.79 1.75 2.14 4.16 3.15
SP1 12/05/18 Alaves Ath Bilbao 3 1 H 1 0 H 1.64 17 -0.25 2.23 2.16 1.77 1.72 2.98 3.29 2.59
SP1 12/05/18 Betis Sevilla 2 2 D 1 0 H 2.35 18 0.00 1.99 1.92 1.99 1.93 3.01 3.73 2.36
SP1 12/05/18 Eibar Las Palmas 1 0 H 1 0 H 2.19 19 -1.25 1.93 1.88 2.02 1.97 1.42 5.10 8.04
SP1 12/05/18 Getafe Ath Madrid 0 1 A 0 1 A 1.46 18 0.00 1.95 1.90 2.10 1.96 3.72 3.04 2.34
SP1 12/05/18 Girona Valencia 0 1 A 0 0 D 2.13 18 0.00 1.93 1.88 2.03 1.97 2.13 3.70 3.50
SP1 12/05/18 La Coruna Villarreal 2 4 A 0 3 A 2.15 17 0.25 2.19 2.11 1.81 1.76 4.71 4.30 1.71
SP1 12/05/18 Real MadridCelta 6 0 H 3 0 H 3.94 19 -1.50 1.96 1.91 2.00 1.95 1.25 6.89 11.56
SP1 12/05/18 Sociedad Leganes 3 2 H 2 1 H 2.04 19 -1.25 2.06 1.99 1.91 1.87 1.40 4.87 9.26
SP1 13/05/18 Espanol Malaga 4 1 H 3 1 H 1.74 19 -0.75 1.86 1.83 2.07 2.03 1.63 3.97 6.24
SP1 13/05/18 Levante Barcelona 5 4 H 2 1 H 3.23 18 1.50 2.11 2.06 1.86 1.81 7.70 5.40 1.40
SP1 19/05/18 Celta Levante 4 2 H 2 1 H 2.82 20 -0.75 2.05 2.01 1.90 1.85 1.60 4.68 5.33
SP1 19/05/18 Las Palmas Girona 1 2 A 1 2 A 2.36 19 0.25 1.94 1.90 2.00 1.95 3.44 4.01 2.06
SP1 19/05/18 Leganes Betis 3 2 H 1 1 D 2.19 17 0.00 2.04 1.99 1.91 1.86 2.41 3.76 2.90
SP1 19/05/18 Malaga Getafe 0 1 A 0 0 D 1.91 19 0.25 1.98 1.92 1.98 1.93 3.26 3.56 2.28
SP1 19/05/18 Sevilla Alaves 1 0 H 1 0 H 2.86 19 -1.25 1.97 1.91 2.00 1.94 1.32 6.09 9.47
SP1 19/05/18 Villarreal Real Madrid2 2 D 0 2 A 3.79 19 0.25 2.05 1.98 1.93 1.87 4.74 5.05 1.62
SP1 20/05/18 Ath Bilbao Espanol 0 1 A 0 1 A 2.06 17 -0.50 2.06 2.02 1.88 1.85 1.95 3.77 4.05
SP1 20/05/18 Ath Madrid Eibar 2 2 D 1 1 D 1.98 19 -1.00 2.09 2.04 1.87 1.82 1.47 4.25 8.80
SP1 20/05/18 Barcelona Sociedad 1 0 H 0 0 D 5.04 19 -2.00 1.94 1.89 2.03 1.97 1.31 6.40 8.60
SP1 20/05/18 Valencia La Coruna 2 1 H 1 0 H 2.98 19 -1.50 2.01 1.97 1.94 1.89 1.27 6.44 10.71

This is how it looks in R for example

# A tibble: 225,468 x 11 X date time size r_version r_arch r_os package version country ip_id 1 1 2014-07-08 00:54:41 80589 3.1.0 x86_64 mingw32 htmltools 0.2.4 US 1 2 2 2014-07-08 00:59:53 321767 3.1.0 x86_64 mingw32 tseries 0.10-32 US 2 3 3 2014-07-08 00:47:13 748063 3.1.0 x86_64 linux-gnu party 1.0-15 US 3 4 4 2014-07-08 00:48:05 606104 3.1.0 x86_64 linux-gnu Hmisc 3.14-4 US 3 5 5 2014-07-08 00:46:50 79825 3.0.2 x86_64 linux-gnu digest 0.6.4 CA 4 6 6 2014-07-08 00:48:04 77681 3.1.0 x86_64 linux-gnu randomForest 4.6-7 US 3 7 7 2014-07-08 00:48:35 393754 3.1.0 x86_64 linux-gnu plyr 1.8.1 US 3 8 8 2014-07-08 00:47:30 28216 3.0.2 x86_64 linux-gnu whisker 0.3-2 US 5 9 9 2014-07-08 00:54:58 5928 NA NA NA Rcpp 0.10.4 CN 6 10 10 2014-07-08 00:15:35 2206029 3.0.2 x86_64 linux-gnu hflights 0.1 US 7 # ... with 225,458 more rows

Specifically, dplyr supplies five 'verbs' that cover most fundamental data manipulation tasks: select(), filter(), arrange(), mutate(), and summarize().

In [225]:
head(select(cran,HomeTeam,AwayTeam,FTAG,FTHG)) #select columns needed , note the order  specefied is maintained
HomeTeamAwayTeamFTAGFTHG
Leganes Alaves 0 1
Valencia Las Palmas0 1
Celta Sociedad 3 2
Girona Ath Madrid2 2
Sevilla Espanol 1 1
Ath BilbaoGetafe 0 0
In [226]:
head(select(cran,HomeTeam:FTR)) #selects all column from HomeTeam to FTR
HomeTeamAwayTeamFTHGFTAGFTR
Leganes Alaves 1 0 H
Valencia Las Palmas1 0 H
Celta Sociedad 2 3 A
Girona Ath Madrid2 2 D
Sevilla Espanol 1 1 D
Ath BilbaoGetafe 0 0 D
In [227]:
head(select(cran,FTR:HomeTeam)) #also possible in reverse order
FTRFTAGFTHGAwayTeamHomeTeam
H 0 1 Alaves Leganes
H 0 1 Las PalmasValencia
A 3 2 Sociedad Celta
D 2 2 Ath MadridGirona
D 1 1 Espanol Sevilla
D 0 0 Getafe Ath Bilbao
In [229]:
head(select(cran,HomeTeam:FTR, -FTAG)) # -column name dosent select specefied column name
HomeTeamAwayTeamFTHGFTR
Leganes Alaves 1 H
Valencia Las Palmas1 H
Celta Sociedad 2 A
Girona Ath Madrid2 D
Sevilla Espanol 1 D
Ath BilbaoGetafe 0 D
In [239]:
cran_sub<-select(cran, -(HS:PSCA), -Div)#removes all the columns from HTR to PSCA
head(cran_sub)
DateHomeTeamAwayTeamFTHGFTAGFTRHTHGHTAGHTR
18/08/17 Leganes Alaves 1 0 H 1 0 H
18/08/17 Valencia Las Palmas1 0 H 1 0 H
19/08/17 Celta Sociedad 2 3 A 1 1 D
19/08/17 Girona Ath Madrid2 2 D 2 0 H
19/08/17 Sevilla Espanol 1 1 D 1 1 D
20/08/17 Ath BilbaoGetafe 0 0 D 0 0 D

"How do I select a subset of rows?" That's where the filter() function comes in.

In [242]:
head(filter(cran_sub, HomeTeam == "Barcelona")) #only rows were HomeTeam is Barcelona
DateHomeTeamAwayTeamFTHGFTAGFTRHTHGHTAGHTR
20/08/17 Barcelona Betis 2 0 H 2 0 H
09/09/17 Barcelona Espanol 5 0 H 2 0 H
19/09/17 Barcelona Eibar 6 1 H 2 0 H
01/10/17 Barcelona Las Palmas3 0 H 0 0 D
21/10/17 Barcelona Malaga 2 0 H 1 0 H
04/11/17 Barcelona Sevilla 2 1 H 1 0 H
In [249]:
filter(cran_sub, HomeTeam == "Barcelona", FTR== "D") # rows where HomeTeam is Barcelona and FTR is D (draw at home)
DateHomeTeamAwayTeamFTHGFTAGFTRHTHGHTAGHTR
02/12/17 Barcelona Celta 2 2 D 1 1 D
11/02/18 Barcelona Getafe 0 0 D 0 0 D
06/05/18 Barcelona Real Madrid2 2 D 1 1 D
In [252]:
filter(cran_sub, HomeTeam == "Barcelona", FTHG>3) # adding logical operators
                                                    # rows where HomeTeam is Barcelona and HTHG is more than 3 
                                                    #(time Barcelona scored more than 3 goals at home)
DateHomeTeamAwayTeamFTHGFTAGFTRHTHGHTAGHTR
09/09/17 Barcelona Espanol 5 0 H 2 0 H
19/09/17 Barcelona Eibar 6 1 H 2 0 H
17/12/17 Barcelona La Coruna 4 0 H 2 0 H
24/02/18 Barcelona Girona 6 1 H 4 1 H
09/05/18 Barcelona Villarreal5 1 H 3 0 H
In [261]:
head(filter(cran_sub, AwayTeam == "Barcelona" | HomeTeam =="Barcelona")) #where rows either home or away team is 
                                                                        #Barcelona
DateHomeTeamAwayTeamFTHGFTAGFTRHTHGHTAGHTR
20/08/17 BarcelonaBetis 2 0 H 2 0 H
26/08/17 Alaves Barcelona0 2 A 0 0 D
09/09/17 BarcelonaEspanol 5 0 H 2 0 H
16/09/17 Getafe Barcelona1 2 A 1 0 H
19/09/17 BarcelonaEibar 6 1 H 2 0 H
23/09/17 Girona Barcelona0 3 A 0 1 A
In [258]:
filter(cran_sub, AwayTeam == "Barcelona", HTHG>HTAG , FTAG>=FTHG ) #Barcelona away game trailing at half time 
                                                                    # but won the game or draw full before full time
DateHomeTeamAwayTeamFTHGFTAGFTRHTHGHTAGHTR
16/09/17 Getafe Barcelona 1 2 A 1 0 H
14/10/17 Ath MadridBarcelona 1 1 D 1 0 H
14/01/18 Sociedad Barcelona 2 4 A 2 1 H
31/03/18 Sevilla Barcelona 2 2 D 1 0 H
In [264]:
filter(cran_sub, is.na(FTHG)) #no missing values in FTHG column
DateHomeTeamAwayTeamFTHGFTAGFTRHTHGHTAGHTR
In [268]:
head(filter(cran_sub, !is.na(FTHG))) #adding !is.na() will remove all NAs in the rows.
DateHomeTeamAwayTeamFTHGFTAGFTRHTHGHTAGHTR
18/08/17 Leganes Alaves 1 0 H 1 0 H
18/08/17 Valencia Las Palmas1 0 H 1 0 H
19/08/17 Celta Sociedad 2 3 A 1 1 D
19/08/17 Girona Ath Madrid2 2 D 2 0 H
19/08/17 Sevilla Espanol 1 1 D 1 1 D
20/08/17 Ath BilbaoGetafe 0 0 D 0 0 D

arrange() is use to sort the columns

In [271]:
head(arrange(cran_sub, FTHG)) #arranges by FTHG values assending
DateHomeTeamAwayTeamFTHGFTAGFTRHTHGHTAGHTR
20/08/17 Ath Bilbao Getafe 0 0 D 0 0 D
20/08/17 La Coruna Real Madrid0 3 A 0 2 A
21/08/17 Malaga Eibar 0 1 A 0 0 D
26/08/17 Alaves Barcelona 0 2 A 0 0 D
27/08/17 Eibar Ath Bilbao 0 1 A 0 1 A
27/08/17 Espanol Leganes 0 1 A 0 1 A
In [273]:
head(arrange(cran_sub, desc(FTHG))) #desc()  sorts by decending
DateHomeTeamAwayTeamFTHGFTAGFTRHTHGHTAGHTR
21/01/18 Real MadridLa Coruna 7 1 H 2 1 H
19/09/17 Barcelona Eibar 6 1 H 2 0 H
13/01/18 Girona Las Palmas 6 0 H 1 0 H
24/02/18 Barcelona Girona 6 1 H 4 1 H
18/03/18 Real MadridGirona 6 3 H 1 1 D
12/05/18 Real MadridCelta 6 0 H 3 0 H
In [276]:
head(arrange(cran_sub,  HomeTeam, desc(FTHG))) #first sorts HomeTeam ascending and then FTHG by desending
DateHomeTeamAwayTeamFTHGFTAGFTRHTHGHTAGHTR
12/05/18 Alaves Ath Bilbao 3 1 H 1 0 H
08/12/17 Alaves Las Palmas 2 0 H 1 0 H
21/01/18 Alaves Leganes 2 2 D 0 0 D
03/02/18 Alaves Celta 2 1 H 2 0 H
07/04/18 Alaves Getafe 2 0 H 0 0 D
23/09/17 Alaves Real Madrid1 2 A 1 2 A

It's common to create a new variable based on the value of one or more variables already in a dataset. The mutate() function does exactly this.

In [284]:
cran_GD <- mutate(cran_sub, GD = FTHG-FTAG)
head(cran_GD)                               #creats new column GD = FTHG - FTAG
DateHomeTeamAwayTeamFTHGFTAGFTRHTHGHTAGHTRGD
18/08/17 Leganes Alaves 1 0 H 1 0 H 1
18/08/17 Valencia Las Palmas1 0 H 1 0 H 1
19/08/17 Celta Sociedad 2 3 A 1 1 D -1
19/08/17 Girona Ath Madrid2 2 D 2 0 H 0
19/08/17 Sevilla Espanol 1 1 D 1 1 D 0
20/08/17 Ath BilbaoGetafe 0 0 D 0 0 D 0
In [278]:
#similary can add , subtract multiply and divide value to columns and creat new columns

summarize()

In [280]:
summarise(cran_sub, AHG = mean(FTHG)) #gives you summary of the column 
                                     #average home goal
AHG
1.547368
In [281]:
summarise(cran_sub, AAG = mean(FTAG)) #average away goal
AAG
1.147368
In [287]:
summarise(cran_GD, AGD = mean(abs(GD))) #average goal diff
AGD
1.421053

Grouping and Chaining with dplyr

In [4]:
library("dplyr")
In [6]:
mydf=read.csv("SP1.csv")
In [11]:
cran<-tbl_df(mydf)
cran_sub<-select(cran, -(HS:PSCA), -Div)
head(cran_sub)
DateHomeTeamAwayTeamFTHGFTAGFTRHTHGHTAGHTR
18/08/17 Leganes Alaves 1 0 H 1 0 H
18/08/17 Valencia Las Palmas1 0 H 1 0 H
19/08/17 Celta Sociedad 2 3 A 1 1 D
19/08/17 Girona Ath Madrid2 2 D 2 0 H
19/08/17 Sevilla Espanol 1 1 D 1 1 D
20/08/17 Ath BilbaoGetafe 0 0 D 0 0 D
In [12]:
summarise(cran_sub, count =n())
count
380
In [42]:
by_team <- group_by(cran_sub, HomeTeam) # group by very important function  for data analysis
team_sum = summarise(by_team, count =n(), unique = n_distinct(FTHG), avg_hg = mean(FTHG))
head(team_sum)
#all team play 19 home game, with unique home goals and their avg home goals
HomeTeamcountuniqueavg_hg
Alaves 19 4 1.105263
Ath Bilbao19 3 1.000000
Ath Madrid19 5 1.578947
Barcelona 19 7 2.789474
Betis 19 5 1.842105
Celta 19 5 1.789474

n() - gives count , n_distinct() - gives unique

In [18]:
# | We need to know the value of 'count' that splits the data into
# | the top 1% and bottom 99% of packages based on total
# | downloads. In statistics, this is called the 0.99, or 99%,
# | sample quantile. Use quantile(pack_sum$count, probs = 0.99) to
# | determine this number.
In [24]:
quantile(team_sum$avg_hg, probs = 0.90) #2.5 and above is top 90%
90%: 2.50526315789474
In [26]:
filter(team_sum, avg_hg >2.5) #only RM and FCB are more than 90%
HomeTeamcountuniqueavg_hg
Barcelona 19 7 2.789474
Real Madrid19 8 2.842105

View(team_sum) to view the table, not yet supported in the Jupyter R kernel

In [29]:
arrange(filter(team_sum, avg_hg >2.5), desc(avg_hg)) #sorting
HomeTeamcountuniqueavg_hg
Real Madrid19 8 2.842105
Barcelona 19 7 2.789474
In [30]:
# | In this script, we've used a special chaining operator, %>%,
# | which was originally introduced in the magrittr R package and
# | has now become a key component of dplyr. You can pull up the
# | related documentation with ?chain. The benefit of %>% is that
# | it allows us to chain the function calls in a linear fashion.
# | The code to the right of %>% operates on the result from the
# | code to the left of %>%.

same above operation done using %>% .

In [41]:
cran_sub %>% group_by(HomeTeam) %>%
summarise(count =n(), unique = n_distinct(FTHG), avg_hg = mean(FTHG)) %>% 
filter(avg_hg >2.5) %>% 
arrange(desc(avg_hg)) 

#1. group by HomeTeam
#2. summariese data
#3. filter based on condition
#4. arrange
#with out saving the varibale and in linear fastion
HomeTeamcountuniqueavg_hg
Real Madrid19 8 2.842105
Barcelona 19 7 2.789474